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Estimation of genewise variance arises from two important ap- 
plications in microarray data analysis: selecting significantly differ- 
entially expressed genes and validation tests for normalization of mi- 
croarray data. We approach the problem by introducing a two-way 
nonparametric model, which is an extension of the famous Neyman- 
Scott model and is applicable beyond microarray data. The problem 
itself poses interesting challenges because the number of nuisance pa- 
rameters is proportional to the sample size and it is not obvious how 
the variance function can be estimated when measurements are cor- 
related. In such a high-dimensional nonparametric problem, we pro- 
posed two novel nonparametric estimators for genewise variance func- 
tion and semiparametric estimators for measurement correlation, via 
solving a system of nonlinear equations. Their asymptotic normality 
is established. The finite sample property is demonstrated by simula- 
tion studies. The estimators also improve the power of the tests for de- 
tecting statistically differentially expressed genes. The methodology 
is illustrated by the data from microarray quality control (MAQC) 
project. 

1. Introduction. Microarray experiments are one of widely used tech- 
nologies nowadays, allowing scientists to monitor thousands of gene expres- 
sions simultaneously. One of the important scientific endeavors of microarray 
data analysis is to detect statistically differentially expressed genes for down- 
stream analysis [Cui, Hwang and Qiu (2005), Fan et al. (2004), Fan and Ren 
(2006), Storey and Tibshirani (2003), Tusher, Tibshirani and Chu (2001)]. 
Standard t-test and F-test are frequently employed. However, due to the 
cost of the experiment, it is common to see a large number of genes with 
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a small number of replications. Even in customized arrays where only sev- 
eral hundreds of genes expressions are measured, the number of replications 
is usually limited. As a result, we are facing a high-dimensional statistical 
problem with a large number of parameters and a small sample size. 

Genewise variance estimation arises at the heart of microarray data anal- 
ysis. To select differentially expressed genes among thousands of genes, the 
t-test is frequently employed with a stringent control of type I errors. The 
degree of freedom is usually small due to limited replications. The power 
of the test can be significantly improved if the genewise variance can be 
estimated accurately. In such the t-test becomes basically a z-test. 

A simple genewise variance estimator is the sample variance of replicated 
data, which is not reliable due to a relatively small number of replicated 
genes. They have direct impact on the sensitivity and specificity of t-test 
[Cui, Hwang and Qiu (2005)]. Therefore, novel methods for estimating the 
genewise variances are needed for improving the power of the standard i-test. 

Another important application of genewise variance estimation arises from 
testing whether systematic biases have been properly removed after applying 
some normalization method, or selecting the most appropriate normalization 
technique for a given array. Fan and Niu (2007) developed such validation 
tests (see Section 4), which require the estimation of genewise variance. The 
methods of variance estimation, like pooled variance estimator, and REML 
estimator [Smyth, Michaud and Scott (2005)], are not accurate enough due 
to the small number of replications. 

Due to the importance of genewise variance in microarray data analy- 
sis, conscientious efforts have been made to accurately estimate it. Various 
methods have been proposed under different models and assumptions. It has 
been widely observed that genewise variance is to a great extent related to 
the intensity level. Kamb and Ramaswami (2001) proposed a crude regres- 
sion estimation of variance from microarray control data. Tong and Wang 
(2007) discussed a family of shrinkage estimators to improve the accuracy. 

Let R g i and G g i, respectively, be the intensities of red (Cy3) and green 
(Cy5) channels for the iih replication of the gth gene on a two-color mi- 
croarray data. The log-ratios and log-intensities are computed, respectively, 
as 

Y gi = \Og 2 {Ggi/Rgi) 3XI& X gi = \ 10g 2 ( G gi Rg j ) , 

i = l,...,I,£ = l,...,jV, 

where / is the number of replications for each gene and N is the number 
of genes with replications. For the purpose of estimating genewise variance, 
we assume that there is no systematic biases or the systematic biases have 
been removed by a certain normalization method. This assumption is always 
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made for selecting significantly differentially expressed genes or validation 
test under the null hypothesis. Thus, we have 

with a g denoting the log-ratio of gene expressions in the treatment and con- 
trol samples. Here, (e s i, . . . , e g j) T follows a multivariate normal distribution 
with e g i ~ iV(0, 1) and Corr(e 9 j,e S j) = p when i 7^ j. It is also assumed that 
observations from different genes are independent. Such a model was used 
in Smyth, Michaud and Scott (2005). 

In the papers by Wang, Ma and Carroll (2009) and Carroll and Wang 
(2008), nonparametric measurement-error models have been introduced to 
aggregate the information of estimating the genewise variance: 

Ygi = a g + a(a g )e gi , 

(1) 

coTi(egi,e gi >) = 0, g = l,...,N,i = l,...,I. 

The model is intended for the analysis of the Affymetrix array (one-color 
array) data in which a g represents the expected intensity level, and Ygi is 
the ith replicate of observed expression level of gene g. When it is applied to 
the two-color microarray data as in our setting, in which a g is the relative 
expression profiles between the treatment and control, several drawbacks 
emerge: (a) the model is difficult to interpret as the genewide variance is a 
function of the log-ratio of expression profiles; (b) errors-in- variable methods 
have a very slow rate of convergence for the nonparametric problem and the 
observed intensity information X g i is not used; (c) they are usually hard 
to be implemented robustly and depend sensitively on the distribution of 
a(a g )e g i and the i.i.d. assumption on the noise; (d) in many microarray 
applications, a g = for most g and hence cr(a g ) are the same for most 
genes, which is unrealistic. Therefore, our model (2) below is complementary 
to that of Wang, Ma and Carroll (2009) and Carroll and Wang (2008), with 
focus on the applications to two-color microarray data. 

To overcome these drawbacks in the applications to microarray data and 
to utilize the observed intensity information, we assume that a g i = a(X g i) for 
a smooth function cr(-). This leads to the following two-way nonparametric 
model: 

(2) Ygi = a g + a(Xgi)egi, g = 1, . . . ,N,i = 1, . . . ,1, 

for estimating genewise variance. This model is clearly an extension of the 
Neyman-Scott problem [Neyman and Scott (1948)], in which the genewise 
variance is a constant. The Neyman-Scott problem has many applications 
in astronomy. Note that the number of nuisance parameters {ct g } is pro- 
portional to the sample size. This imposes an important challenge to the 
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nonparametric problem. It is not even clear whether the function er(-) can 
be consistently estimated. 

To estimate the genewise variance in their microarray data analysis, Fan et 
al. (2004) assumed a model similar to (2). But in the absence of other avail- 
able techniques, they had to impose that the treatment effect {a g } is also a 
smooth function of the intensity level so that they can apply nonparametric 
methods to estimate genewise variance [Ruppert et al. (1997)]. However, this 
assumption is not valid in most microarray applications, and the estimator 
of genewise variance incurs big biases unless {a g } is sparse, a situation that 
Fan et al. (2004) hoped. Fan and Niu (2007) approached this problem in 
another simple way. When the noise in the replications is small, that is, 
X g i ~ Xg, where X g is the sample mean for the ^th gene. Therefore, they 
simply smoothed the pair {(X g ,f g )}, where f g = Y^,i=i(^gi ~ j(J ~~ !)• 
This also leads to a biased estimator, which is denoted as q(x). One asks 
naturally whether the function cr(-) is estimable and how it can be estimated 
in the general two-way nonparametric model. 

We propose a novel nonparametric approach to estimate the genewise 
variance. We first study a benchmark case when there is no correlation be- 
tween replications, that is, p = 0. This corresponds to the case with indepen- 
dent replications across arrays [Fan, Peng and Huang (2005), Huang, Wang 
and Zhang (2005)]. It is also applicable to those dealt by the Neyman- 
Scott problem. By noticing E{(Yg, — Y g ) 2 \X g i} is a linear combination of 
a 2 (X g i), we obtain a system of linear equations. Hence, cr 2 (-) can be es- 
timated via nonparametric regression of a proper linear combination of 
{(Xgi ~ Y g ) 2 ,i = 1,...,/} on {X g i}. The asymptotic normality of the es- 
timator is established. In the case that the replication correlation does not 
vanish, the system of equations becomes nonlinear and cannot be analyti- 
cally solved. However, we are able to derive the correlation corrected esti- 
mator, based on the estimator without genewise correlation. The genewise 
variance function and the correlation coefficient of repeated measurements 
are simultaneously estimated by iteratively solving a nonlinear equation. 
The asymptotic normality of such estimators is established. 

Model (2) can be applied to the microarrays in which within-array repli- 
cations are not available. In that case, we can aggregate all the microarrays 
together and view them as a super array with replications [Fan, Peng and 
Huang (2005), Huang, Wang and Zhang (2005)]. In other words, i in (2) 
indexes arrays and p can be taken as 0, namely (2) is the across-array repli- 
cation with p = 0. 

The structure of this paper is as follows. In Section 2, we discuss the 
estimation schemes of the genewise variance and establish the asymptotic 
properties of the estimators. Simulation studies are given in Section 3 to 
verify the performance of our methods in the finite sample. Applications to 
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the data from Microarray Quality Control (MAQC) project are showed in 
Section 4 to illustrate the proposed methodology In Section 5, we give a 
short summary. Technical proofs are relegated to the Appendix. 

2. Nonparametric estimators of genewise variance. 

2.1. Estimation without correlation. We first consider the specific case 
where there is no correlation among the replications Y g \ , . . . , Y g j of the same 
gene g under model (2). This is usually applicable to the across-array repli- 
cation and stimulates our procedure for the more general case with the 
replication correlation. In the former case, we have 

E[(Y gi - F 9 ) 2 |X] = (/ - l) 2 a 2 (X gi )/I 2 + £V (X ffi )// 2 , i = 1, . . . ,1. 

We will discuss in Section 2.2.4 the case that 1 = 2. For I > 2, we have 
/ different equations with / unknowns a 2 (X g i), a 2 (X g 2), ■ ■ ■ ,a 2 (X g i) for a 
given gene g. Solving these / equations, we can express the unknowns in 
terms of {E[(Y" 5 j — Y 3 ) 2 |X]}[ =1 , estimable quantities. Let 

r g = ((Y gl -Y g ) 2 ,...,(Y gI -Y g ) 2 f and <x 2 = (a 2 (X gl ), . . . , a 2 {X gI )) T . 

Then, it can easily be shown that a 2 = BE[r s |X], where B is the coefficient 
matrix: 

B = ((/ 2 -/)I-E)/(/-l)(/-2) 

with I being the I x I identity matrix and E the I x I matrix with all 
elements 1. Define 

^9 = (Z g i, • • ■ , Z g i) T = Br g . 

Then we have 

(3) a 2 (X gi ) = E[Z gt \X]. 

Note that the left-hand side of (3) depends only on X g i, not other variables. 
By the the double expectation formula, it follows that the variance function 
<7 2 (-) can be expressed as the univariate regression 

(4) a 2 (x) = E[Z gi \X gi = x], i = l,... ,1. 

Using the synthetic data {(X g i, Z g i), g = 1, . . . , N} for each given i, we can 
apply the local linear regression technique [Fan and Gijbels (1996)] to obtain 
a nonparametric estimator fjf(x) of c 2 (-). Explicitly, for a given kernel K 
and bandwidth h, 

(5) f, 2 (x)=^W N ,<^^Z gi , i = l,...,I, 
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with 

tt r / \ ,-l r w \ Sn,2 - uSn,1 

W N ,i(u) = h L K(u)- — '- 

JN,2&N,0 — <->jv,1 

where K h (u) = h' l K{ujh) and S N> i = ^ g=1 K h (X gi -x)[(X g i-x)/h] l : whose 
dependence on i is suppressed. Thus, we have I estimators fjf(x), . . . ,fjj(x) 
for the same genewise variance function o~ 2 {-). Each of these / estimators 
fjf (x) is a consistent estimator of a 2 (x). To optimally aggregate those I 
estimators, we need the asymptotic properties of r/(x) = (fj 2 (x), . . . ,fjj(x)) T . 
Denote 

/OO /'OO 
u 2 K(u)du, dx = I K 2 (u)du, 
-OO J — OO 

ai=E[a(X gi )} and a 2 = E[a 2 {X gi )}. 

Assume that X g i are i.i.d. with marginal density fx(') an d £ g i are i.i.d. 
random variables from the standard normal distribution. In the following 
result, we assume that / is fixed, but N diverges. 

Theorem 1 . Under the regularity conditions in the Appendix, for a fixed 
point x, we have 

S -i/2 (r7 _ ( a 2 (a .) + + 0p (h 2 ))e) A iV(0,I), 

provided that h — > and Nh — > 00, where e = (1, 1, . . . , 1) T and 

S = V 1 I + V 2 (E-I) 

with b(x) = ^c K {o- 2 {x))" , 

d K L 4^ , 4 + 4(/-l)(/-3) 2 2 2 



* = iVT^) | 2CJ & + (/-!)(/ -2)2 ^ ^ + (/-!)(/ -2)" 
1 f 4 8 2, n 2(7-3) 2 1 

y2 = iv\ (7^1)2- (*) " (73T)2^ W + (/ _ 1)2(/ _ 2) -2j. 

Note that V 2 is one order of magnitude smaller than V\. Hence, the es- 
timators fjf(x), . . . , ijj(x) are asymptotically independently distributed as 
N(a 2 (x) + b(x), V\). Their dependence is only in the second order. The best 
linear combination of / estimators is 

(6) fj 2 (x) = [fj 2 (x] +rj 2 (x) + --- + fjj(x)]/I 
with the asymptotic distribution 

(7) N{a 2 (x) + b(x), + (1 - l/I)V 2 ). 
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See also the aggregated estimator (16) with p = 0, which has the same 
asymptotic property as the estimator (8). See Remark 1 below for addi- 
tional discussion. 

Theorem 1 gives the asymptotic normality of the proposed nonparametric 
estimators under the presence of a large number of nuisance parameters 
{dg}^ = i- With the newly proposed technique, we do not have to impose any 
assumptions on a g such as sparsity or smoothness. This kind of local linear 
estimator can be applied to most two-color microarray data, for instance, 
customized arrays and Agilent arrays. 



2.2. Variance estimation with correlated replications. 



2.2.1. Aggregated estimator. We now consider the case with correlated 
with-array replications. There is a lot of evidence that correlation among 
within-array replicated genes exists [Smyth, Michaud and Scott (2005), Fan 
and Niu (2007)]. Suppose that within-array replications have a common 
correlation corr (Y g i,Y g j\'X.) = p when i ^ j. Observations across different 
genes or arrays are independent. Then the conditional variance of (Y g i — Y g ) 
can be expressed as 

v a r[(Y gi -Y g )\X] 

(8) = (I-l) 2 a 2 (X gi )/I 2 + 2p < X 9MX 9 k)/I 2 

l<j<k<i, 

+ 2(1 -l)pY j a 2 {X g] )/I 2 -Y,a(X g .{)a(X go )/I 2 . 

This is a complex system of nonlinear equations and the analytic form cannot 
be found. Innovative ideas are needed. 

Using the same notation as that in the previous section, it can be calcu- 
lated that 

E[Z gi \X\ = a 2 {X gt ) -j^E^^fe) 



2 

+ (/_!)(/_ 2 ) ^ po-{X gj )a{X gk ). 



±<j<k<I, 

Taking the expectation with respect to X g j for all j j^i, we obtain 
(9) E[Z gi \X gi = x] = o- 2 (x) - 2paia(x) + pa\ = rf(x), 



where a 1 = E[a(X)]. 
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Here, we can directly apply the local linear approach to all aggregated 
data {(X g i, Zgi)}l^ =l , due to the same regression function (9). Let f)\(-) be 
the local linear estimator of rj 2 (-), based on the aggregated data. Then 



N I / v - \ 

(io) to=EE^(^V E )^ 



3=1 i=l 



with 



Jir I \ L-lr// \ SnI,2~ uSnI,1 

W N {u) = h l K{u)- '- 

ONI,0&NI,2 — J a 



'NI,1 

where Sm,i = Z^3=i ^2i=i Kh{X g i — x)[(X g i — x)/h] 1 . There are two solutions 
to (9): 



(11) a A ( X , p)WW = p &l ± ^2 & 2 _ ffi + ^2 (x)) 

Notice that given the sample X and Y, &a{x, p)^)^ 2 ) are continuous in both 
a; and p. For p < 0, <ta(o;, p)^) should be used since the standard deviation 
should be nonnegative. Since cta(x, p)^ > &a(x, p)^ for every x and p, by 
the continuity of the solution in p, we can only use the same solution when 
p changes continuously. Then ua{x, p)^ should always be used regardless 
of p. From now on, we drop the superscript and denote 



(12) a A (x) = pa l + ^p 2 al-pal + fj 2 A {x). 

This is called the aggregated estimator. Note that in (12), p, o~\ and er(-) are 
all unknown. 

2.2.2. Estimation of correlation. To estimate p, we assume that there 
are J independent arrays ( J > 2). In other words, we observed data from 
(2) independently J times. In this case, the residual maximum likelihood 
(REML) estimator introduced by Smyth, Michaud and Scott (2005) is as 
follows: 



(13) po 



Z^g=l s B,g 2^g=l s W,g 
Z^3=l s B,g + (I ~ 1) 2^3=1 s W,g 



where s% g = I {J - l)" 1 £/ =1 (Y fli ~ Y ^ with Y gj = J" 1 Y gij and Y g = 
£j=i ^aj is the between-arrays variance and s 2 ^ is the within-array 

] 

w 



variance: 



3=1 i=l 
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As discussed in Smyth, Michaud and Scott (2005), the estimator po of p 
is consistent when var(Y^-|X) = o~ g is the same for all i = 1, . . . , I and j = 
1,..., J. However, this assumption is not valid under the model (2) and a 
correction is needed. We propose the following estimator: 

( 14 ) - _ 0-2 2^g=l s B,g 2^g=l s W,g 



The consistency of p is given by the following theorem. 

Theorem 2. Under the regularity condition in the Appendix, the esti- 
mator p of p is y/~N -consistent: 

p-p = P (N~ 1 ' 2 ). 

With a consistent estimator of p, a±, a 2 and <ta( - ) can be solved by the 
following iterative algorithm: 

Step 1. Set fj 2 A {-) as an initial estimate of cr A (-). 
Step 2. With &a{ - ), compute 

N N 

(15) &x = N- 1 Y,*A{Xgi), a 2 = N- 1 Y J ^ A {X gi ), p = Po a 2 /&l 
9=1 9=1 

Step 3. With &i, 02 and p, compute <ja(') using (12). 
Step 4. Repeat steps 2 and 3 until convergence. 

This provides simultaneously the estimators a\, a 2 , p and <ta( - )- From our 
numerical experience, this algorithm converges quickly after a few iterations. 
When the algorithm converges, the estimator o\{x) is given by 



(16) a A (x) = pai + ^p 2 af-paf + rj 2 A (x). 

Note that the presence of multiple arrays is only used to estimate the 
correlation p for the replications. It is not needed for estimating the genewise 
variance function. In the case of the presence of J arrays, we can take the 
average of the J estimates from each array. 

2.2.3. Asymptotic properties. Following a similar idea as the case with- 
out correlation, we can derive the asymptotic property of fj A (x). 

Theorem 3. Under the regularity conditions in the Appendix, for a fixed 
point x, we have 

{V*r 1/2 {fl 2 A (x) - [r, 2 {x) + P(x)] + o P (h 2 )} A N(0, 1), 
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provided that h — >■ and Nh — > oo, with f3(x) = ^-cx(?? 2 (^))" and 

where 

V{ = — % A 2a\x) - 8pa l0 - 3 (x) + C 2 a 2 (x) + C 3 a(x) + C 4 }, 

Vi = jj{D a\x) + D 1( t 3 (x) + D 2 a 2 {x) + D z a(x) + D A } 
with coefficients C 2 , ■ ■ ■ , C4, Dq, . . . , D4 defined in the Appendix. 

The asymptotic normality of o~\{x) can be derived from that of rf A (x). 
More specifically, cr 2 A {x) = (p(r]\(x)) with ip(z) = (pai + yf p 2 o\ — paf + z) 2 . 
The derivative of (/?(•) with respect to z is tp(z) = po\j \J p 2 a\ — paf + z + 1. 
Then, by the delta method, we have 

{V*}- 1/2 (6- 2 A (x) - <p(r, 2 (x) + P(x) + o P {h 2 ))) A N(0, ^ 2 (v 2 (x))). 



Remark 1 . An alternative approach when correlation exists is to apply 
the same correlation correction idea to {X g i, Z g i}^ =1 for every replication 2, 
resulting in the estimator af{x). In this case, it can be proved that the best 
linear combination of the estimator is 

(17) a 2 (x) = [aj(x) + a 2 (x) + ■ ■ • + aj(x)]/I. 

This estimator has the same asymptotic performance as the aggregated es- 
timator. However, we prefer the aggregated estimator due to the following 
reasons: the equation (16) only needs to be solved once by using the algo- 
rithm in Section 2.2.2, all data are treated symmetrically, and fj\(-) can be 
estimated more stably. 

2.2.4. Two replications. The aforementioned methods apply to the case 
when there are more than two replications. For the case 1 = 2, the equations 
for var[(Y 9 j — Yg)|X] collapse into one. In this case, it can be shown using 
the same arguments before that 

(18) vai[{Y gi -Y g )\X gi = x\ = \a 2 {x) + \a 2 -\pa 1 a{x), z = 1,2, 

where a 2 = E[a 2 (X g i)]. In this case, the left-hand side is always equal to 
w W [{Y gl -Y g2 )/2\X gi = x}. 
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Let f] 2 (x) be the local linear estimator of the function on the right-hand 
side by smoothing {{Y gl - Y g2 ) 2 /4}* =1 on {X gl }^ =l and {X g2 }% =l . Then 
the genewise variance is a solution to the following equation: 

(19) a(x) = pa x + - a 2 + 4f) 2 (x). 

The algorithm in Section 2.2.2 can be applied directly. 

3. Simulations and comparisons. In this section, we conduct simulations 
to evaluate the finite sample performance of different variance estimators 
(, 2 (x), f} 2 (x) and a\{x). First, the bias problem of the naive nonparametric 
variance estimator £ 2 (x) is demonstrated. It is shown that this bias issue 
can be eliminated by our newly proposed methods. Then we consider the 
estimators fj 2 (x) and &\{x) under different configurations of the within-array 
replication correlation. 

3.1. Simulation design. In all the simulation examples, we set the num- 
ber of genes N = 2000, each gene having 1 = 3 within-array replications 
and J = 4 independent arrays. For the purpose of investigating the genewise 
variance estimation, the data are generated from model (2). The details of 
simulation scheme are summarized as follows: 

a g : The expression levels of the first 250 genes are generated from the stan- 
dard double exponential distribution. The rest are 0s. These expression 
levels are the same over 4 arrays in each simulation, but may vary over 
simulations. 

X: The intensity is generated from a mixture distribution: with probability 
0.7 from the distribution 0.0004(x — 6) 3 /(6 < x < 16) and 0.3 from the 
uniform distribution over [6,16]. 

e: e g i is generated from the standard normal distribution. 
cr 2 (-): The genewise variance function is taken as 

a 2 (x) = 0.15 + 0.015(12 - x) 2 I{x < 12}. 

The parameters are taken from Fan, Peng and Huang (2005). The kernel 
function is selected as |^(1 — M 3 ) 3 ^! 21 ! — !)■ I n addition, we fix the band- 
width h = 1 for all the numerical analysis. 

For every setting, we repeat the whole simulation process for T times and 
evaluate the estimates of cr 2 (-) over K = 101 grid points {xk}£ = i on the 
interval [6, 16]. For the kth grid point, we define 

T 

B k = a 2 (x k ) - a 2 (x k ) with a 2 (x k ) = T" 1 ^ 6- 2 (x k ), 

t=i 

S k = T- 1 ^t(^)-a 2 (x k )] 2 , 
t=\ 
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and MSEfc = B 2 + S k . Let /(•) be the density function of intensity X. Let 

K K K K 

Bias 2 = B 2 k f(x k ) I £ f(x k ), VAR = £ S fc /(x fc ) / ]T /(a*) 
fc=i fc=i fe=i fe=i 

and 

K K 
MISE = MSE k f(x k ) I /(*k) 

k=l k=l 

be the integrated squared bias (Bias 2 ), the integrated variance (VAR), and 
the integrated mean squared error (MISE) of the estimate <5" 2 (-)> respectively. 
For the tth simulation experiment, we define 

K K 

lSE t = £> t 2 (z fe ) - a 2 (x k )) 2 f(x k ) I /C**) 

k=l k=l 

be the integrated squared error for the tth simulation. 



3.2. The bias of naive nonparametric estimator. A naive approach is to 
regard a g in (2) as a smooth function of X g i, namely, a g = a(X g i). The 
function a(-) can be estimated by a local linear regression estimator, result- 
ing in an estimated function &(■). The squared residuals {r 2 gi }^ =l is then 

further smoothed on {X g i}^ =1 to obtain an estimate £ 2 (x) of the variance 

function <r 2 (-), where r g i = Y g i — a(X g i) [Ruppert et al. (1997)]. 

To provide a comprehensive view of the performances of the naive and 
the new estimators, we first compare the performances of £ 2 (x) and f/ 2 (x) 
under the smoothness assumption of the gene effect a g . Data from the naive 
nonparametric regression model is also generated with 

a(x)=exp^- i _^ 1 _ i3)2 ^/{12 < x < 14}. 

This allows us to understand the loss of efficiency when a g is continuous 
in Xgi. This usually does not occur for microarray data, but can appear in 
other applications. Note that a(-) is zero in most of the region and thus is 
reasonably sparse. Here, the number of simulations is taken to be T = 100. 
The data is generated with the assumption that p = 0, in which case the 
variance estimators f) 2 {x) and cr\(x) have the same performance (see also 
Table 2 below). Thus, we only report the performance of i) 2 (x). 

In Table 1, we report the mean integrated squared bias (Bias 2 ), the mean 
integrated variance (VAR), and the mean integrated squared error (MISE) 
of £ 2 (x) and f) 2 (x) with and without the smoothness assumption on the 
gene effect a g . From the left panel of Table 1, we can see that when the 
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Table 1 

Mean integrated squared bias (Bias 2 ), mean integrated variance (VAR), mean integrated 
squared error (MISE) over 100 simulations for variance estimators £ 2 {x) and f] 2 (x). Two 
different gene effect functions a(-) are implemented. All quantities are multiplied by 1000 







Smooth gene effect 




Nonsmooth gene 


effect 




Bias 2 


VAR 


MISE 


Bias 2 


VAR 


MISE 




0.01 
0.57 


0.14 
0.24 


0.15 
0.80 


16.00 
0.00 


1.47 
0.22 


17.47 
0.23 



Table 2 

Mean integrated squared bias (Bias 2 ), mean integrated variance (VAR), mean integrated 
squared error (MISE) over 1000 simulations for different variance estimators f/ 2 (x) and 
oq{x). Seven different correlation schemes are simulated: p= — 0.4, p = — 0.2, p = 0, 



p 


= 0.2, p 


= 0.4, p = 0.6 


and p — O.f 


]. All quantities 


are multiplied by 1000 














P 








-0.4 


-0.2 





0.2 


0.4 


0.6 


0.8 


Bias 2 


fj 2 (x) 


5.93 


1.48 


0.00 


1.48 


5.91 


13.31 


23.67 




&\{x) 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 




a 2 Q (x) 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


VAR 


n 2 (x) 


0.44 


0.33 


0.24 


0.16 


0.10 


0.05 


0.02 




a\(x) 


0.27 


0.25 


0.24 


0.22 


0.20 


0.19 


0.20 




&o{x) 


0.27 


0.25 


0.24 


0.22 


0.20 


0.18 


0.23 


MISE 


f) 2 (x) 


6.37 


1.81 


0.24 


1.64 


6.01 


13.37 


23.69 




a 2 A (x) 


0.27 


0.25 


0.24 


0.22 


0.21 


0.19 


0.20 




»o(x) 


0.27 


0.25 


0.24 


0.22 


0.20 


0.18 


0.24 



smoothness assumption is valid, the estimator £ 2 (x) outperforms fj 2 (x). The 
reason is that the mean function a{X g i) depends on the replication and is 
not a constant. Therefore, model (2) fails and fj 2 (x) is biased. One should 
compare the results with those on the second row of the right panel where the 
model is right for fj 2 (x). In this case, fj 2 (x) performs much better. Its variance 
is about 3/2 as large as the variance in the case that mean is generated from 
a smooth function a(X g i). This is expected. In the latter case, to eliminate 
a g , the degree of freedom reduces from 1 = 3 to 2, whereas in the former 
case, a(Xgi) can be estimated without losing the degree of freedom, namely 
the number of replications is still 3. The ratio 3/2 is reflected in Table 1. 
However, when the smoothness assumption does not hold, there is serious 
bias in the estimator £ 2 (x), even though that a g is still reasonably sparse. 
The bias is an order of magnitude larger than those in the other situations. 
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To see how variance estimators behave, we plot typical estimators £ 2 (x) 
and fj 2 (x) with median ISE value among 100 simulations in Figure 1. The 
solid line is the true variance function while the dotted and dashed lines rep- 
resent £ 2 (x) and f] 2 (x), respectively. On the left panel of Figure 1, we can 
see that estimator £ 2 (x) outperforms the estimator f) 2 (x) when the smooth- 
ness assumption is valid. The region where the biases occur has already 
been explained above. However, £ 2 (x) will generate substantial bias when 
the nonparametric regression model does not hold, and at the same time, 
our nonparametric estimator fj 2 (x) corrects the bias very well. 

3.3. Performance of new estimators. In this example, we consider the 
setting in Section 3.1 that the smoothness assumption of the gene effect a g 
is not valid. For comparison purpose only, we add an oracle estimator ctq(x) 
in which we assume that o~i, a<i and p are all known. We now evaluate the 
performance of the estimators fj 2 (x), cr\(x) and Oq{x) when the correlation 
between within-array replications varies. To be more specific, seven different 
correlation settings are considered: p = —0.4,-0.2,0, 0.2, 0.4, 0.6, 0.8, with 
p = representing across-array replications. In this case, we increase the 
number of simulations to T = 1000. Again, we report Bias 2 , VAR and MISE 
of the three estimators for each correlation setting in Table 2. When p = 0, 
all the three estimators give the same bias and variance. This is consistent 
with our theory. We can see clearly from the table that, when p ^ 0, the 
estimator d\(x) produces much smaller biases than fj 2 (x). In fact, when \p\ 
as small as 0.2, the bias of fj 2 (x) already dominates the variance. 

It is worth noticing that the performance of Oq(x) and a\{x) are almost 
always the same, which indicates that our algorithm for estimating p, a\ 




Fig. 1. Variance estimators £ 2 {x) and rf(x) with median performance when different 
gene effect function a(-) are implemented. Left panel: smooth a(-) function. Right panel: 
nonsmooth a(-) function. 
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and o"2 is very accurate. To see this more clearly, the squared bias, variance 
and MSE of the estimator p, o\ and 02 in a\{x) under the seven correlation 
settings are reported in Table 3. Here, the true value of 04 and Gi is 0.4217 
and 0.1857. For example, when p = 0.8, the bias of p is less than 0.002 
for dr\(x), which is acceptable because the convergence threshold in the 
algorithm is set to be 0.001. 

In Figure 2, we render the estimates fj 2 (x) and a\{x) with the median 
ISE under four different correlation settings: p = —0.4, p = 0, p = 0.6 and 
p = 0.8. We omit the other correlation schemes since they all have similar 
performance. The solid lines represent the true variance function. The dotted 
lines and dashed lines are for fj 2 (x) and a\(x), respectively. For the case p = 
0, the two estimators are indistinguishable. When p < 0, fj 2 (x) overestimates 
the genewise variance function, whereas when p > 0, it underestimates the 
genewise variance function. 

4. Application to human total RNA samples using Agilent arrays. Our 

real data example comes from Microarray Quality Control (MAQC) project 
[Patterson et al. (2006)]. The main purpose of the original paper is on com- 
parison of reproducibility, sensitivity and specificity of microarray measure- 
ments across different platforms (i.e., one-color and two-color) and testing 
sites. The MAQC project use two RNA samples, Stratagene Universal Hu- 
man Reference total RNA and Ambion Human Brain Reference total RNA. 
The two RNA samples have been assayed on three kinds of arrays: Agilent, 
CapitalBio and TeleChem. The data were collected at five sites. Our study 
focuses only on the Agilent arrays. At each site, 10 two-color Agilent mi- 
croarrays are assayed with 5 of them dye swapped, totaling 30 microarrays. 

Table 3 

Squared bias, variance and MSE of p, <ti and 01 in the estimate a\(x). 
All quantities are multiplied by 10 6 



P 







-0.4 


-0.2 





0.2 


0.4 


0.6 


0.8 


p 


Bias 2 


0.07 


0.04 


0.01 


0.00 


0.00 


0.00 


3.90 




VAR 


7.90 


16.91 


28.65 


36.17 


35.68 


27.21 


20.44 




MSE 


7.97 


16.95 


28.66 


36.17 


35.68 


27.21 


24.35 




Bias 2 


0.24 


0.23 


0.19 


0.14 


0.11 


0.05 


2.47 




VAR 


11.65 


11.52 


11.79 


12.46 


13.64 


15.55 


18.66 




MSE 


11.89 


11.75 


11.99 


12.60 


13.75 


15.59 


21.12 




Bias 2 


0.14 


0.14 


0.12 


0.09 


0.08 


0.05 


0.67 




VAR 


10.34 


10.17 


10.45 


11.12 


12.24 


13.96 


16.16 




MSE 


10.47 


10.31 


10.57 


11.20 


12.32 


14.00 


16.83 
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4.1. Validation test. In the first application, we revisit the validation test 
as considered in Fan and Niu (2007). For the purpose of the validation tests, 
we use gProcessedSignal and rProcessedSignal values from Agilent Feature 
Extraction software as input. We follow the preprocessing scheme described 
in Patterson et al. (2006) and get 22,144 genes from a total of 41,675 noncon- 
trol genes. Among those, 19 genes with each having 10 replications are used 
for validation tests. Under the null hypothesis of no experimental biases, a 
reasonable model is 

(20) Y g i = a g + egi, e gi ~ N(0, a 2 g ), i = 1, . . . ,I,g = 1, . . . , G. 

We use the notation G to denote the number of genes that have I replica- 
tions. For our data, G = 19 and / = 10. Note that G can be different from 
N, the total number of different genes. The validation test statistics in Fan 
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and Niu (2007) include weighted statistics 

r i =e{b^ -^) 2 h 2 X T*=i:{iy*-* a \/°* 

9=1 L i=l J 9=1 U=l 

and unweighted test statistics 
f G 7 

U=i i=i 

t G I 

T ±= ££i^ 

l g=l i=l 

where A/ = \J2I{I — l)/ir and k| = var(^[ =1 |e 9 j — e g \/a g ). Under the null 
hypothesis, the test statistic T\ is \ 2 distributed with degree of freedom 
(J — 1)G and T2,T3 and T4 are all asymptotically normally distributed. As 
a result, the corresponding p-values can be easily computed. 

Here, we apply the same statistics T%, T2, T3 and T4 but we replace the 
pooled sample variance estimator by the aggregated local linear estimator 

/ 1 
a 2 g = Y, a 2 A (X gi )f(X gt ) I £ f(Xgi), 

i=l i=l 

where / is the estimated density function of X g {. The difference between the 
new variance estimator and the simple pooled variance estimator is that we 
consider the genewise variance as a nonparametric function of the intensity 
level. The latter estimator may drag small variances of certain arrays to 
much higher levels by averaging, resulting in a larger estimated genewise 
variance and smaller test statistics or bigger p-values. 

In the analysis here, we first consider all thirty arrays. The estimated 
correlation among replicated genes is p = 0.69. The p- values based on the 
newly estimated genewise variance are depicted in Table 4. As explained in 
Fan and Niu (2007), T4 is the most stable test among the four. It turns out 
that none of the arrays needs further normalization, which is the same as 
Fan and Niu (2007). Furthermore, we separate the analysis into two groups: 
the first group using 15 arrays without dye-swap, which has the estimated 
correlation p = 0.66, and the second group using 15 arrays with dye-swap, 
resulting in an estimated correlation p = 0.34. The p- values are summarized 
in Table 5. Results show that array AGL-2-D3 and array AGL-2-D5 need 
further normalization if 5% significance level applies. The difference is due 
to decreased estimated p for the dye swap arrays and p-values are sensitive 
to the genewise variance. We also did analysis by separating data into 6 
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Table 4 

Comparison of p -values for T\,...,Ti for MAQC project data 
considering all 30 arrays together 



p- values 



Slide name 


2i 


T 2 


T 3 


T 4 


AGL-1-C1 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-C2 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-C3 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-C4 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-C5 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-D1 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-D2 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-D3 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-D4 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-D5 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-2-C1 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-2-C2 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-2-C3 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-2-C4 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-2-C5 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-2-D1 


1.0000 


0.9999 


0.9996 


0.9999 


AGL-2-D2 


0.8387 


0.9011 


0.8953 


0.9182 


AGL-2-D3 


0.3525 


0.1824 


0.3902 


0.1905 


AGL-2-D4 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-2-D5 


0.8820 


0.8070 


0.8848 


0.7952 


AGL-3-C1 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-3-C2 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-3-C3 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-3-C4 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-3-C5 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-3-D1 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-3-D2 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-3-D3 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-3-D4 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-3-D5 


1.0000 


1.0000 


1.0000 


1.0000 



groups: with and without dye swap, and three sites of experiments. Due to 
the small sample size, the six estimates of p range from 0.08 to 0.74, and we 
also find that array AGL-2-D3 needs further normalization. 

4.2. Gene selection. To detect the differentially expressed genes, we fol- 
low the filter instruction and get 19,802 genes out of 41,000 unique noncon- 
trol genes as in Patterson et al. (2006), that is, 1=1. The dye swap result 
was averaged before doing the one-sample i-test. Thus, at each site, we have 
five microarrays. 
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Table 5 

Comparison of p-values for Ti, . . . , T4 for MAQC project data 
considering the arrays with and without dye-swap separately 



p-values 



Slide name 


Ti 


T 2 


T 3 


T 4 


AGL-1-C1 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-C2 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-C3 


1.0000 


1.0000 


0.9999 


1.0000 


AGL-1-C4 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-C5 


1.0000 


1.0000 


0.9999 


1.0000 


AGL-1-D1 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-D2 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-D3 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-D4 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-1-D5 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-2-C1 


1.0000 


1.0000 


0.9943 


1.0000 


AGL-2-C2 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-2-C3 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-2-C4 


0.0152 


0.9493 


0.3931 


0.9136 


AGL-2-C5 


1.0000 


1.0000 


0.8060 


1.0000 


AGL-2-D1 


0.7806 


0.8074 


0.6622 


0.6584 


AGL-2-D2 


0.2170 


0.2984 


0.1651 


0.2217 


AGL-2-D3 


0.0002 


0.0000 


0.0001 


0.0000 


AGL-2-D4 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-2-D5 


0.1236 


0.0662 


0.0669 


0.0300 


AGL-3-C1 


1.0000 


1.0000 


0.9996 


1.0000 


AGL-3-C2 


1.0000 


1.0000 


0.9988 


1.0000 


AGL-3-C3 


1.0000 


1.0000 


0.9977 


1.0000 


AGL-3-C4 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-3-C5 


1.0000 


1.0000 


0.9999 


1.0000 


AGL-3-D1 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-3-D2 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-3-D3 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-3-D4 


1.0000 


1.0000 


1.0000 


1.0000 


AGL-3-D5 


1.0000 


1.0000 


1.0000 


1.0000 



For each site, significant genes are selected based on these 5 dye-swaped 
average arrays. For all N = 19,802 genes, there are no within-array replica- 
tions. However, model (2) is still reasonable, in which i indexes the array. 
Hence, the "within-array correlation" becomes "between-array correlation" 
and is reasonably assumed as p = 0. 

In our nonparametric estimation for the variance function, all the 19,802 
genes are used to estimate the variance function, which gives us enough 
reason to believe that the estimator cj^(x) is close to the inherent true 
variance function cr 2 (x). 
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We applied both the t-test and z-test to each gene to see if the logarithm 
of the expression ratio is zero, using the five arrays collected at each lo- 
cation. The number of differentially expressed genes detected by using the 
two different tests under three Fold Changes (FC) and four significant levels 
are given in Table 6. Large numbers of genes are identified as differentially 
expressed, which is expected when comparing a brain sample and a tissue 
pool sample [Patterson et al. (2006)]. We can see clearly that the z-test as- 
sociated with our new variance estimator cr\(x) leads to more differentially 
expressed genes. For example, at site 1, using a = 0.001, among the fold 
changes at least 2, t-test picks 8231 genes whereas z-test selects 8875 genes. 
This gives an empirical power increase of (8875 — 8231)/19,802 ~ 3.25% in 
the group with observed fold change at least 2. 

To verify the accuracy of our variance estimation in the z-test, we compare 
the empirical power increase with the expected theoretical power increase. 
The expected theoretical power increase is computed as 

(21) &ve{P z (fi g /a g ) - Pt^AVg/Vg)}, 

taking the average of power increases across all \x g ^ 0. However, in the ab- 
sence of the availability, we replace \i g by its estimate, which is the sample 
average of n = 5 observed log-expression ratios. Table 7 depicts the results 
at three different sites, in which the columns "Theo" refer to the expected 
theoretical power increase defined by (21), with fj, g replaced by Y g and a g re- 
placed by its estimate from the genewise variance function, and the columns 
"Emp" refer to the empirical power increase. 

There are two things worth noticing here. First, for expected theoretical 
power increase, we use the sample mean Y g = fj, g + e g instead of the real gene 
effect which is not observable, so it inevitably involves the error term e g . 

Table 6 

Comparison of the number of significantly differentially expressed genes 



p < 0.05 p < 0.01 p < 0.005 p< 0.001 







t-test 


z-test 


t-test 


z-test 


t-test 


z-test 


t-test 


z-test 


Agilent 1 


FC> 1.5 


12692 


12802 


12464 


12752 


12313 


12722 


11744 


12646 




FC>2 


8802 


8879 


8654 


8872 


8556 


8869 


8231 


8858 




FC>4 


3493 


3493 


3431 


3493 


3376 


3493 


3231 


3493 


Agilent 2 


FC> 1.5 


12282 


12678 


11217 


12587 


10502 


12536 


8270 


12421 




FC>2 


8644 


8877 


7908 


8875 


7452 


8861 


6125 


8828 




FC>4 


3600 


3649 


3188 


3649 


2964 


3649 


2422 


3649 


Agilent 3 


FC>1.5 


12502 


12692 


11994 


12576 


11694 


12519 


10788 


12374 




FC>2 


8689 


8832 


8344 


8810 


8150 


8800 


7591 


8762 




FC>4 


3585 


3603 


3378 


3602 


3278 


3602 


2985 


3600 
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Table 7 

Comparison of expected theoretical and empirical power difference (in percentage) 





a = 


0.05 


a = 


0.01 


a = 


0.005 


a = 


0.001 


Theo 


Emp 


Theo 


Emp 


Theo 


Emp 


Theo 


Emp 


Agilent 1 


2.52 


0.61 


6.08 


3.75 


8.06 


5.59 


13.66 


11.74 


Agilent 2 


4.03 


7.56 


10.11 


17.61 


13.61 


22.86 


23.75 


37.63 


Agilent 3 


3.02 


2.56 


7.14 


7.39 


9.42 


10.19 


15.94 


18.18 


Average 


3.19 


3.58 


7.78 


9.58 


10.36 


12.88 


17.79 


22.51 



Second, the power functions P z (/u) and P*(/i) depend sensitively on \i and the 
tails of the assumed distribution. Despite of these, the expected theoretical 
and empirical power increases are in the same bulk and the averages are very 
close. This provides good evidence that our genewise variance estimation has 
an acceptable accuracy. 

We also apply SIMEX and permutation SIMEX methods in Carroll and 
Wang (2008) to the MAQC data, to illustrate its utility. As mentioned in the 
Introduction, their model is not really intended for the analysis of two-color 
microarray data. Should we only use the information on log-ratios (Y), the 
model is very hard to interpret. In addition, one might question why the 
information on X (observed intensity levels) is not used at all. Nevertheless, 
we apply the SIMEX methods of Carroll and Wang (2008) to only the log- 
ratios Y in the two-color data and produce similar tables to the Tables 6 
and 7. 

From the results, we have the following understandings. First, all the num- 
bers for 2-test in Tables 8 and 9 at all significance levels are approximately 
the same. In fact, the p-values are very small so that numbers at different 
significance levels are about the same. That indicates that both SIMEX and 
permutation SIMEX method are tending to estimate genewise variance very 
small, making the test statistics large for all the time. On the other hand, 
our method estimates the genewise variance moderately so that the num- 
bers are not exactly the same for different significance levels. Second, in the 
implementation, we found that the SIMEX and permutation SIMEX is com- 
putationally expensive (more than one hour) while our method only takes 
a few minutes. Third, from Tables 10 and 11 we can see that the expected 
theoretical power increase and the empirical ones are reasonably close, which 
are in lines with our method. 

5. Summary. The estimation of genewise variance function is motivated 
by the downstream analysis of microarray data such as validation test and 
selecting statistically differentially expressed genes. The methodology pro- 
posed here is novel by using across-array and within- array replications. It 
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Table 8 

Comparison of the number of significantly differentially expressed genes using SIMEX 

method 



p < 0.05 p < 0.01 p < 0.005 p < 0.001 







t-test 


z-test 


t-test 


z-test 


t-test 


z-test 


t-test 


z-test 


Agilent 1 


FC> 1.5 


12692 


12820 


12464 


12820 


12313 


12820 


11744 


12820 




FC>2 


8802 


8879 


8654 


8879 


8556 


8879 


8231 


8879 




FC>4 


3493 


3493 


3431 


3493 


3376 


3493 


3231 


3493 


Agilent 2 


FC>1.5 


12282 


12721 


11217 


12721 


10502 


12721 


8270 


12721 




FC>2 


8644 


8878 


7908 


8878 


7452 


8878 


6125 


8878 




FC>4 


3600 


3649 


3188 


3649 


2964 


3649 


2422 


3649 


Agilent 3 


FC> 1.5 


12502 


12760 


11994 


12760 


11694 


12760 


10788 


12760 




FC>2 


8689 


8836 


8344 


8836 


8150 


8836 


7591 


8836 




FC>4 


3585 


3603 


3378 


3603 


3278 


3603 


2985 


3603 



Table 9 

Comparison of the number of significantly differentially expressed genes using 

permutation SIMEX 



p < 0.05 p<0.01 p < 0.005 p< 0.001 







t-test 


z-test 


t-test 


z-test 


t-test 


z-test 


t-test 


z-test 


Agilent 1 


FOl.5 


12692 


12820 


12464 


12820 


12313 


12820 


11744 


12820 




FC>2 


8802 


8879 


8654 


8879 


8556 


8879 


8231 


8879 




FC>4 


3493 


3493 


3431 


3493 


3376 


3493 


3231 


3493 


Agilent 2 


FC> 1.5 


12282 


12721 


11217 


12721 


10502 


12721 


8270 


12721 




FC>2 


8644 


8878 


7908 


8878 


7452 


8878 


6125 


8878 




FC>4 


3600 


3649 


3188 


3649 


2964 


3649 


2422 


3649 


Agilent 3 


FC>1.5 


12502 


12760 


11994 


12760 


11694 


12760 


10788 


12760 




FC>2 


8689 


8836 


8344 


8836 


8150 


8836 


7591 


8836 




FC>4 


3585 


3603 


3378 


3603 


3278 


3603 


2985 


3603 



Table 10 

Comparison of expected theoretical and empirical power difference using SIMEX method 

(in percentage) 





a = 


0.05 


a — 


0.01 


a = 


0.005 


a = 


0.001 


Theo 


Emp 


Theo 


Emp 


Theo 


Emp 


Theo 


Emp 


Agilent 1 


2.43 


2.06 


7.17 


5.42 


10.30 


7.34 


20.71 


13.44 


Agilent 2 


7.16 


3.41 


19.20 


12.06 


26.17 


16.90 


43.46 


30.42 


Agilent 3 


4.18 


2.88 


11.71 


7.38 


16.45 


9.89 


31.38 


17.57 


Average 


4.59 


2.78 


12.69 


8.29 


17.64 


11.38 


31.85 


20.48 
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Table 11 

Comparison of expected theoretical and empirical power difference using permutation 
SIMEX method (in percentage) 





a = 


0.05 


a — 


0.01 


a = 


0.005 


a = 


0.001 


Theo 


Emp 


Theo 


Emp 


Theo 


Emp 


Theo 


Emp 


Agilent 1 


1.89 


2.86 


5.66 


6.43 


8.19 


8.59 


16.75 


15.07 


Agilent 2 


4.84 


7.37 


13.44 


17.22 


18.97 


22.50 


36.90 


37.26 


Agilent 3 


2.89 


4.91 


8.34 


10.13 


11.87 


13.11 


23.44 


21.31 


Average 


3.20 


5.05 


9.15 


11.26 


13.01 


14.74 


25.70 


24.55 



does not require any specific assumptions on a g such as sparsity or smooth- 
ness, and hence reduces the bias of the conventional nonparametric estima- 
tors. Although the number of nuisance parameters is proportional to the 
sample size, we can estimate the main interest (variance function) consis- 
tently. By increasing the degree of freedom largely, both the validation tests 
and z-test using our variance estimators are more powerful in identifying 
arrays that need to be normalized further and more capable of selecting 
differentially expressed genes. 

Our proposed methodology has a wide range of applications. In addition 
to the microarray data analysis with within-array replications, it can be also 
applied to the case without within-array replications, as long as the model 
(2) is reasonable. Our two-way nonparametric model is a natural extension 
of the Neyman-Scott problem. Therefore, it is applicable to all the problems 
where the Neyman-Scott problem is applicable. 

There are possible extensions. For example, the SIMEX idea can be ap- 
plied on our model in order to take into account the measurement error. 
We can also make adaptations to our methods when we have a prior cor- 
relation structure among replications other than the identical correlation 
assumption. 

APPENDIX 

The following regularity conditions are imposed for the technical proofs: 

1. The regression function <J 2 (x) has a bounded and continuous second 
derivative. 

2. The kernel function K is a bounded symmetric density function with a 
bounded support. 

3. h^0,Nh^oo. 

4. E[<7 8 (X)] exists and the marginal density fx{~) is continuous. 

We need the following conditional variance-covariance matrix of the ran- 
dom vector Zi g in our asymptotic study. 
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Lemma 1. Let ft be the variance-covariance matrix of Z 5 conditioning 
on all data X. Then, respectively, the diagonal and off- diagonal elements are 



(22) 



2a A (X gi ) + (j _ i)2(j _ 2)2 ^2 a2 ( X 9k) a2 ( X gb 



+ ( i_T)(j-V a2(Xgt) ^ cj2( ^ ) ' * = x ' • • • ' j ' 



2 



+ (/_l)2(/_ 2 )2 E ^P^VP^) 



(23) 



k^l 
k,l^i,j 

- (/ _ i)2 (/ _ 2) E ^(^*)(^(^) + Ax 9j )), 

i,3 = 1, ---J, i 7^3- 

Proof. Let A be the variance-covariance matrix of r g conditioning on 
all data X. By direct computation, the diagonal elements are given by 

^ = var[(F 5J -F 9 ) 2 |X] 
(24) = ^j^a\X gi ) + ^^£a 2 (A-y 2 (A>) + |rE^*) 

kj^i k^i 
+ Jl E ^OM^S*). i=l,...,I, 

and the off-diagonal elements are given by 
= cov{[(Y gi - Y g )\ (Y gj - y,) 2 ]|X} 



(25) 



2(/ j4 1)2 k 4 (^) + ^ 4 (^')] + A{1 o-\X gi )o\X g3 ) 
] Y.° 2 i X 9 k){o- 2 (X gi ) + o 2 (X gj )) 



J 4 

+^ E ^ 2 (^v 2 (^)+^E^(^)- 

k,lj^i,j;l<k kj^i,j 

Using Q = BAB T , we can obtain the result by direct computation. □ 
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The proofs for Theorems 1 and 3 follow a similar idea. Since Theorem 1 
does not involve a lot of coefficients, we will show the proof of Theorem 1 
and explain the difference in the proof of Theorem 3. 

Proof of Theorem 1. First of all, the bias of rjf(x) comes from the 
local linear approximation. Since {{X g i, Z g i)}^ =1 is an i.i.d. sequence, by (4) 
and the result in Fan and Gijbels (1996), it follows that 

E{77?(aO|X} = a 2 (x) + b(x) + o P {h 2 ). 

Similarly, the asymptotic variance of rjf{x) also follows from Fan and Gijbels 
(1996). 

We now prove the off-diagonal elements in matrix var[T7|X] 

(26) C ov[(f,t(x),f)*(x))\X] = V 2 + o P (l/N). 

Recalling that fjf(x) = Y,g=i W N:i ((X gi - x)/h)Z gi , we have 

cov[(^^),^(x))|X]=^W A r I i(^^)^(^ £ ) cov[(Z gi ,Z gj )\X}. 

The equality follows by the fact that cov[(Z 5 j, Z ff /j)|X] = when g ^ g' . 
Recall Viij = cov[(Z g i, Z g j)\X] and define RN,g = N ■ WNj((X g j — x)/h)VLij. 
Thus, 

(27) iV .cov[(^(x),^(x))|X]=^^/^^V i v, 9 . 

9=1 

The right-hand side of (27) can be seen as local linear smoother of the syn- 
thetic data {{X g i, i?jv,g)}^Li- Although R,N,g involves N at the first glance, 
its conditional expectation E[Rjy tg \X g i = x] and conditional variance 
vai[RN ig \X g i = x] do not grow with N. Since {(X g i, Rn j9 )}^ =1 is an i.i.d. 
sequence, by the results in Fan and Gijbels (1996), we obtain 

N ■ cov[(7?f (x),7?f (x))|X] = E[R N , g \X gi =x] + o P (l). 

To calculate E[R]y tg \X g i = x], we apply the approximation W^,i(u) = K(u)(l + 
op(l)) / '(N hfx (x)) in the example of Fan and Gijbels [(1996), page 64] and 
have the following arguments 



y< 



E 



N -mMx-) hKh ^- x) ^ x - 



(1 + op(1)) 



(fx(x)) 1 J K(u)Qij\x 3Z =x(x + hu,s)f x (x + hu)duds + op(l) 
NV 2 + o P {l), 
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where s represents all the integrating variables corresponding to X g \ , . . . , X g j 
except X g i and X g j. That justifies (26). 

To prove the multivariate asymptotic normality 

(28) 5T 1/2 (*? - (cx 2 (x) + b(x) + o P (h 2 ))e) A N(0,h), 

we employ Cramer- Wold device: for any unit vector a = (ai, . . . ,aj) T in R , 

Denote by Q g>i = W N ,i((X gi - x)/h)(Z gi - a 2 (X gi )) and Q g = Ya=i aiQg,i- 
Note that the sequence {Q g }^ =1 is i.i.d. distributed. To show the asymptotic 
normality of F* , it is sufficient to check Lyapunov's condition: 

E g liE[|Q g | 4 |X] 
hm £ = 0. 



N- 



°°(ELe[|Q 9 | 2 |x])2 



To facilitate the presentation, we first note that sequences {Q g ,i\^ = i are i.i.d. 
and satisfy Lyapunov's condition for each fixed i. Denote 5^ i = 

£f = iE[|Q 5ii | 2 |X]. And recall that 5 2 Ni = var[r? 2 (x)|X] = Op((Nh)~ 1 ). Let 
c* be a generic constant which may vary from one line to another. We have 
the following approximation: 

N 

^E[|Q 5ii | 4 |X] = c*N' 3 E{Kt(X gi - x)[{Z gi - a 2 {X gi )f\Tt]}{l + 0p (l)) 

9=1 

= o P ((Nhy 3 ). 

Therefore, E^=i E[|Q 9 ,j| 4 |X] =o(8'^ !i ). By the marginal Lyapunov condi- 
tions, we have the following inequality: 

N IN 

^E[Q 4 |X] < c* ^ ^ E[|Q 9ii | 4 |X] = c*I ■ o P ((Nh)~ 2 ) = o P ((Nh)- 2 ). 

9=1 »=1 9=1 

For the denominator, we have the following arguments: 

N N N 

^E[|Q g | 2 |X] = ^a 2 ^E[Q 2 i4 |X] + ^a i a i ^E[g 5 , i g 5j |X] 

9=1 i 9=1 i+3 9=1 

= a i var ^ ( x ) I X l + X] ai a i COV t(^ 2 ( X ) ' Vj ( x ) ) I X] 

= OpdNhy 1 ) + OpiN" 1 ) 
= Op((Nh)~ 1 ). 
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Note that the second to last equality holds by the asymptotic conditional 
variance-covariance matrix S. Therefore Lyapunov's condition is justified. 
That completes the proof. □ 



Proof of Theorem 2. First of all, for each given g, 

Es B, g =Ivar(Y g j) = a 2 +p(I- l)a\. 

Note that by (8), we have 

E(*W- - Y g3 f = r 2 [I(I - 1)* 2 + P{I -W- 2)af - 2(1 - l) 2 pa 2 ,] 
= r l (I-l)(a 2 -pal). 

Thus, for all g, we have 

Es 2 Vg = a 2 - pa\. 

Since {s 2 B g } and {s^ 9 } are i.i.d. sequences across the N genes, by the 
central limit theorem, we have 

1 N 

A? E s %9 = ^ + P^ 1 ~ 1 V? + Op(iV~ 1/2 ), 

9=1 

1 N 

9=1 

Therefore, 

a 2 + p(I - l)of - ci2 + pcf + Op^N-V 2 ) 



Po 



a 2 + p(J - l)a\ + (I — l)(o- 2 - p^) + P (iV- 1 /2) 
■■pal/a 2 + Op{N-V 2 ). □ 



Proof of Theorem 3. Note that 
var[r)i(:r)|X] = EE ^ var[^|X] 

+ E E ^ f^V^) ^ f^P) cov[(Z 94 , ^-)|X]. 

Following similar steps in the proof of Theorem 1 , one can verify var \ff\ (x) \ 
X] = V{/I+ (1 - l/f)V£ + opdNh)- 1 ), where the coefficients C 2 , . . . , C 4 , D , 
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-D4 are as follows: 

4(1 + p 2 )a 2 + [4p(J - 2) + 4p 2 (2I - 3)]aj 



C 2 
C 3 



D 



I -I 

j 2 ( y I-3)af + 8(p 2 + p)a 1 a 2 
1-1 



c ± = (/ _ 1) 2 (J _ 2) {( 1 + p 2 )°i + 2 (p 2 + pw - 3 v 2 ^ 



+ (/-3)(/-4)pV} 



_ / , 4p 2(1 + p 2 ) 

1)2 = (J _ 1) 2 (I _ 2) {( / " 3 ) V + (( J - 2 ) 2 + ^ - 2 ( J - 2 )}^ 
Ds = - {I 8 _ {I in f_ 2) {(P 2 + P)n°* + ( J " 4 ) A?}, 



4 



(/-l) 2 (I-2) 2 



4 
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